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Abstract 

Typical learning curves for Soft Margin Classifiers (SMCs) learning 
both realizable and unrealizable tasks are determined using the tools of 
Statistical Mechanics. We derive the analytical behaviour of the learning 
curves in the regimes of small and large training sets. The generaliza- 
tion errors present different decay laws towards the asymptotic values 
as a function of the training set size, depending on general geometrical 
characteristics of the rule to be learned. Optimal generalization curves 
are deduced through a fine tuning of the hyperparameter controlling the 
trade-off between the error and the regularization terms in the cost func- 
tion. Even if the task is realizable, the optimal performance of the SMC 
is better than that of a hard margin Support Vector Machine (SVM) 
learning the same rule, and is very close to that of the Bayesian classifier. 
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1 Introduction 



The recently introduced Support Vector Machines || (SVM) may be 
considered as an extension of the perceptron. The latter is only able 
to perform linear separations by a hyperplane in input space. When 
the problem is not linearly separable, instead of searching for more 
complex surfaces in input space, SVMs map the input patterns onto 
a space of much higher dimension with the hope that in this featurs- 
space the task be linearly separable. To cope with the problem of 
the very high dimensionality of the space, Cortes and Vapnik || 
proposed to find the Maximal Stability Perceptron (which is the 
solution that maximizes the distance from the hyperplane to the 
closest pattern). The corresponding weight vector, normal to the 
hyperplane, has the remarkable property that it can be written as a 
linear combination of some of the training patterns, called Support 
Vectors. This weight vector w minimizes the SVM cost function, 

E=~w-w, (1) 

subject to the following conditions, imposed to all the patterns /i = 
1, . . . , M, of the training set Cm = {(x^, 2 ^)} with x e ^,1° e 
{-1,1}, 

xJ(wx M + 6)>l, /i=l,...,M. (2) 

where is the distance of the hyperplane to the origin. Condi- 
tions (|^) impose that all the patterns be farther than a distance 
l/||w|| from the hyperplane, and minimization of ([!]) ensures that 
this distance is maximized. 

It can be shown that the solution to this extremum satisfies w = 
J2^f=\ a M x M x ° where the coefficients are nonnegative, and many 
of them are vanishing. If we now introduce a mapping x — ► $(x) the 
cost function for the SVM is given by eq. [I], but replacing everywhere 
x M by $(x M ). 

This machines, called hard margin machines, have been succes- 
fully analyzed within the approach of statistical mechanics [f|, [I]. [H]. 

0- 

The preceding formulation supposes that the task is linearly sep- 
arable in the working space, as otherwise conditions (0) cannot be 
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fulfilled for all the patterns fi. This may arise either because the 
selected mapping into the feature-space is not adequate, or because 
there is intrinsic noise in the data, and the task cannot be learned 
without training errors. To cope with this problem, a modification 
of the cost function ([]]) and the conditions @ has been suggested |§ , 
giving raise to the concept of Soft Margin Classifier, hereafter called 
SMC. 

For simplicity, in this paper we restrict ourselves to consider 
SMCs acting on input space. This is a useful first step towards 
a better theoretical understanding of SMCs with full functional- 
ity, i.e. using a mapping to a high dimensional feature space. We 
consider classifiers without biasQ. 

To find the SMC, one has to minimize the function: 

i M 

E c ,k = -w • w + C £ OA (3) 

where k is a positive exponent and C a positive constant, subject 
to the following conditions for \i — 1, M 



^ Es >-^ > !-Cm> ( 4 ) 
> 0. (5) 

The slack variable is a measure of how much the constraint (Q) 
is violated for pattern /i. In particular, if £ M > 1 then < 0, which 
means that pattern /i is wrongly classified. Unlike hard margin 
classifiers, in which all the training patterns are excluded from a 
strip of width l/||w|| on both sides of the separating hyperplane, 
in the case of SMCs, patterns with < < 1 lie inside this strip, 
called soft margin. 

The patterns with > 0, which are the training patterns either 
wrongly classified as well as those correctly classified lying within 
the above mentionned strip, are the Support Vectors. 

The exponent k in (|3]) is usually set to 1 or 2 so that the cost 
function be a quadratic function of the unknowns w and £ M (/i = 

1 With this restriction, linearly separable tasks that would need a bias become non- 
realizable. This allows us to explore a larger set of non-realizable rules 



.3 



1, . . . , M). Under these conditions, the minimum of the cost func- 
tion is unique [0], a fact that gives the SVMs a big advantage over 
other learning algorithms which require a search of the lowest of 
several local minima. 

For practical implementations it is useful to formulate the dual 
problem and use the corresponding Kuhn- Tucker conditions, as was 
done for our simulations. We do not go into further details here, as 



these have been extensively discussed in the litterature ]TT 



The value of the hyperparameter C in (y) sets the compromise 
between large margins and small numbers of errors. In practice, 
C should be adjusted, either by trial and error or using more so- 



phisticated methods p2| ^3] to get the best performance out of the 
classifier. 

The paper is organized as follows: in section |2] we clarify what 
is meant by typical properties of a classifier and give a brief survey 
of the method used. The learning curves for a variety of differ- 
ent tasks are determined and discussed in section |3]. We analyze 
the problems of patterns whose classes are random variables (sec- 
tion |3.1| ) , patterns classified with a rule given by a teacher with the 
same structure as the trained classifier (section |3.2| ), or a different 
structure (section |373| ). We also consider tasks where the patterns' 
classes are corrupted by noise (section |3.4p . We relate the different 
behaviours of the generalization error to simple geometrical proper- 
ties of the rule to be learned. We also present the typical properties 
of the optimal generalizers, that is, obtained with the values of C 
that minimize the generalization error. The main results are sum- 
marized in the last section (§), where we discuss some perspectives 
of this work. 



2 What are typical properties ? 

Worst case analysis of a learning machine gives exact bounds for dif- 
ferent quantities of interest, like the generalization error, the training 
error, the number of support vectors, etc. However, very often these 
exact bounds are not tight. In this paper we focus on the typical 
properties of SMCs faced with particular classes of problems. Our 
results, obtained with the tools of statistical mechanics, predict the 
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expected (averaged over all the possible training sets) behaviour of 
SMCs. As in the case of perceptrons, this approach allows to get 
insight on the learning properties of the classifiers. The method, 
thoroughly described in a recent book [Bl, has already been pre- 
sented elswhere in the context of SVMs [[y|. It was applied for the 
first time to learning machines by E. Gardner M , who studied a per- 
ceptron learning a binary classification task. Statistical Mechanics 
is generally used to determine the properties of the minima of (cost) 
functions in very high dimensional spaces, when the cost depends 
on a large number of random variables, the training patterns. Like 
in statistical learning theory f25fl , these are assumed to be drawn 
independently from a probability distribution 

Schematically, training a classifier amounts to minimize a cost 
function (which plays the role of an energy) in the space of the classi- 
fier's parameters, which in the case of SMCs, are the iV-dimensional 
weight vector and the M slack variables. This minimization is done 
using the information contained in the set of M training patterns. 
The typical properties (training error, generalization error, etc.) are 
obtained through averages over all the possible training sets corre- 
sponding to the task, in the limit of very large N and M, keeping 
constant the ratio M/N = a, hereafter called training set size. It can 
be proved that in the limit where both N and M diverge (called ther- 
modynamic limit), with a held fixed, these averages coincide with 
the value taken by the considered property for almost every training 
set. This means that, if we made the "experiment" of training a 
given machine with a given training set for given (large enough) iV 
and M, we would find that the value of, say, the generalization error 
is very close to the one calculated with Gardner's method. Within 
this context, values of iV ~ 50 are usually already large. The rig- 
orous validity of the techniques used in these calculations has been 



established recently [24 



3 Results 

As already stated, we consider perceptrons trained to minimize the 
cost function with conditions (f|) and (Q), with C > 0. In the 
following we assume, for the sake of simplicity, that the components 
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of the input patterns are independently drawn from a gaussian dis- 
tribution of variance 1 / \fN : 

e -A f x 2 /2 

p < x > = WJnW (6) 

We study binary classsification tasks. The label of each pattern, 
denoted by x° G { — 1, 1}, is assigned following a rule, which may be 
deterministic or stochastic. In the latter case the labels are drawn 
from a probability distribution. 

The training error e t is the fraction of patterns in the training set 
that, after training, are incorrectly classified: 

^ = ^£©Hv)- (7) 

where is given in equation ([5]) • The generalization error is defined 
as the probability of misclassifying a new pattern after the network 
has been trained: 
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]T / dxP(x°|x)P(x)0(-x°w-x) (8) 

n •* 



where P(x°|x) is the probability that pattern x belongs to class x° 
(notice that for deterministic rules this is a delta function). If the 
classifier cannot implement the rule with a vanishing generalization 
error for any training set size a (even for a — > oo), the rule is said 
to be unrealizable. Otherwise, it is realizable. 

The pertinence of our analytic results has been verified by com- 
paring them to numerical simulations. The latter, presented in the 
figures of the following paragraphs, were done for N = 100 and 
different values of M. This value of N is large enough for M/N 
be a good approximation of a, which in the theoretical approach 
corresponds to the ratio of M/N in the limit M — > oo, iV — > oo. 

We present results for different kinds of rules, starting with the 
extreme case where there is no rule at all, following with the case 
of a realizable rule, and at the end we analyze several examples of 
non realizable rules. We also report results for different exponents 
k and values of C, including the optimal value and the limiting case 
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of C — > oo, for each of the rules considered. We call optimal the 
value C opt (a) that gives, on average, the best generalizer, i.e. the 
lowest generalization error e g for each value of a. 

3.1 Patterns with random classes 

In this section we consider that the patterns are labelled randomly. 
Thus, it is impossible for any classifier to predict with any degree 
of certainty the correct label. Therefore, e g = 1/2. Gardner || 
analyzed with the statistical mechanics approach the properties of a 
perceptron learning such kind of task using the number of training 
errors Me t (see eq. (|7|)) as cost function. She obtained the average 
training error e t , which is the lowest curve in figure [IQ. This curve 
serves as a reference for the performance of the SMC, as by definition 
it gives the lowest training error that can be achieved on average. 
It is well known that et = for < a < 2, which means that it is 
possible to train a perceptron to classifiy correctly any set of training 
patterns only if a < 2. The value a c = 2 is the typical capacity of 
the perceptron: for a > a c only a subset of zero measure among all 
the possible training sets are learnable without training errors: the 
linearly separable ones. But, for a < 2, a typical training set has 
probability 1 of not being linearly separable. 

In figure [I] we show our results for the SMCs. For finite values 
of C the average training error does not vanish at any a, as was to 
be expected from the fact that SMCs do not aim at minimizing this 
quantity. Moreover, the fraction of training errors is rather large 
compared to the minimal possible values. The best performance is 
obtained in the limit C — > oo. In this limit, the SMC achieves the 
perceptron's maximal capacity: its training error vanishes both with 
k = 1 and k = 2 if a < a c . For a > a c the best performance is 
obtained using the exponent k — 1 in the cost function (0). This 
result is further discussed in the conclusions. Notice that when k — 1 
the fraction of errors leaves the value continuously, while the curve 
for k = 2 presents a discontinuity at a = a c , where it jumps from 
e t = to a finite value e t = 0.105. 

2 Although this curve is not exact, as the approximations used to obtain it break down for 
a > 2, the corrections to it are believed to be small |9|. 
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Figure 1: Average value of the training error for different SMCs learning a random 
task. The points correspond to the result of simulations, averaged over ~ 100 different 
training sets. 
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Figure 2: Average value of the training and generalization errors for different SMCs 
learning a realizable task. The curve for C op t almost coincides with the one with 
C — > oo for k — 1 and with the one of the Bayesian classifier for k — 2. The points 
correspond to simulations, averaged over the necessary number of training sets to 
ensure that the error bars are smaller than the symbols 



3.2 A realizable rule 

Consider now a linearly separable rule so that, at least in principle, 
a perceptron is able to achieve a vanishing generalization error. To 
ensure that the rule is realizable, the labels of the examples are given 
by another perceptron called teacher. 

x° = sign (w ■ x) , (9) 

where wo is the the teacher's weight vector. The average training 
and generalization errors for k — 1 and k = 2 are represented on 
figure |2|, for different values of C. 

Even though we are considering a realizable rule, which means 
that it is always possible to find a classifier achieving e t = 0, all the 
SMCs (with finite C) end up with a finite fraction of training errors. 
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SMCs with k = 2 perform better, both in training and in gen- 
eralization, than those with k = 1. This is so beacuse the term 



proportional to in the cost function (131) penalizes more heavily 



the errors (which have > 1) when k = 2. 

The generalization error of the SMCs has a non monotonic be- 
haviour as a function of C. On increasing C at any given a, e g first 
decreases, reaches a minimum value for C = C opt (a), and for larger 
values of C it increases. In the limit C — > oo, we obtain the hard 
margin solution, which has larger e g than the optimum. 

It is interesting to notice that with C = C opt (a), which gives (by 
definition) the smallest generalization error, the corresponding train- 
ing errors are not minimal. A similar result has been obtained |7j] 



for a perceptron learning with the algorithm Minimerror [15], which 



minimizes a temperature dependent logistic cost function. In that 
case, the parameter that plays the role of C is the temperature. It 
was shown that in the limit of zero temperature Minimerror con- 
verges to the maximal margin perceptron, which is nothing but the 
hard margin SVM, with e t = 0. However, at finite temperature, the 
algorithm allows to obtain better generalization performance than 
the hard margin SVM, at the price of making training errors. 

For the sake of comparison we included in the same figures the 
generalization error of the bayesian perceptron learning a realizable 
rule [I3|, which is known to be the optimal generalizer. We see that 
for C = C opt the best SMC is obtained with k = 2. The relative 
difference of its generalization error with respect to the bayesian 
one is at most 1.7% (see fig||). Thus, very good generalization is 
achieved at the expense of some training errors. 

In the limit of very large values of a, the generalization error 
(that coincides in all cases with that of the training error), presents 
different behaviours. For all finite values of C, and both for k — 1 
and k = 2, e g ~ C~ 1//6 a~ 2 ' 3 . That is, e g decreases with the training 
set size slower than the usual a -1 law, found in the litterature for 
zero training error solutions. This behaviour changes qualitatively 
if C — > oo in which limit we recover the well known hard margin 
result e g ~ 0.5005/a. For the curves obtained using C opt we obtain: 
€g (k = 1) ~ 0.488a- 1 and e g {k = 2) ~ 0.449a" 1 . Despite the fact 
that C opt is finite, the behaviour in this case is proportional to a -1 
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Figure 3: Comparison between the generalization error for the SMC with C op t and 
k — 2 and the generalization error for the bayesian classifier. 



because C opt depends on a. These results are to be compared with 
the behaviour of the bayesian classifier for large a: e g ~ 0.442a" 1 . 

As will be discussed later, the expected value of the first term in 
the cost function (^) is important to understand the behaviour of 
the learning curves of SMCs. In the present case of a realizable rule 
we obtain, for large a, (||w||)/\/iv ~ (Ca) 1 ^ 3 . 



3.3 Deterministic unrealizable rules 

Unrealizable rules are either deterministic, given by "teachers" that 
have a more complex structure than the "student", or stochastic, 
which are inherently unrealizable because of the randomness in- 
volved. 

We have studied with great detail the behavior of SMCs facing 
some deterministic unrealizable tasks elsewhere p0 |. Here, we only 



summarize our results. We consider rules corresponding to several 
parallel separating hyperplanes, as those sketched on figure §. The 
class of an input vector x is given by: 

x° = sgn(P(w -x)). (10) 
where V is, in principle, any function of its argument. In fact, the 
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Figure 4: Three nonlinear rules. A) V{z) = (z - 8) , B) V(z) = z(z - 6), C) 
V{z) = (z - 5)z(z + 6) 

rule (0) only depends on the number of zeros of V(z) and not 
on its particular expression. We therefore assume, without loss of 
generality, that it is a polynomial. If it has m zeros, the rule (|10|) 
corresponds to a set of m parallel discriminating hyperplanes defined 
by the equations w ■ x — Zj = 0, where {z{ : % — 1, m} are the 
zerosQ of V(z). The distance to the origin of each hyperplane is 

N/ll w oll- 

One quantity of interest is the distribution of stabilities of the M 
training patterns with respect to the SMC solution, defined by 

n * X,, h,, , , 

h ||w|| 1 1 w 1 1 

If 7^ > 0(< 0) the pattern fi is correctly (incorrectly) classified. The 
norm of 7 M is the distance of pattern \x to the hyperplane orthogonal 
to w that contains the origin of coordinates. The support vectors 
have 7 M < l/||w||. The distribution of stabilities of the training 
patterns, averaged over the possible training sets, 

= (12) 

gives useful information regarding the SMC's solution. For the con- 
sidered rules, the distribution of stabilities for any finite a is nonva- 
nishing everywhere, and has a single discontinuity at 7 = yN /||w||. 

3 Here, as in the rest of the paper, we call zeros the points where the function changes sign. 
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Figure 5: Distribution of stabilities for two SMCs learning patterns given by 
the rule with V(z) = z(z — 2). The full lines represent simulations averaged 
over 100 training sets, for 200 exemples with N = 100. The dashed lines are 
the theoretical predictions for a = 2. 

A) k — 1. The arrow shows the position of the Dirac delta as predicted by the 
theory. 

B) k = 2. 

In addition, if k = 1 there is a Dirac delta at this position, indicat- 
ing that there is a finite fraction of training patterns placed exactly 
at the SMCs margin. The absence of such delta peak for k = 2 is 
related to the fact that these patterns do not belong to the support 
vectors in this caseQ, whereas they are support vectors if k = 1. The 
generalization error and the average fraction of support vectors are 
obtained by performing the integral of p{^) between — oo and 0, and 
between — oo and \/]V/||w||, respectively. In fig. [5] we show an ex- 
ample of the distribution of stabilities for one particular unrealizable 
rule. 

We analyzed three types of rules: the linear rule given by V(z) = 
z — 5, the "sandwich" rule V(z) = z(z — S) and the "reversed-wedge" 

4 This results from the analysis of the Kuhn Tucker equations. 
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rule V(z) = (z — 5)z(z + 5). The corresponding learning curves 
present very different behaviours depending on the rules, but, given 
the rule, they are qualitatively similar for both exponents, k = 1 
and k = 2. 

For the reversed wedge rule with 5 > 5 C = V21n2 and for 
the sandwich rule, e t (ct) approaches its finite asymptotic value for 
a — > oo from above. e g decrease rapidly at small a, when e t 
is still relatively large. In both cases the best generalizer (with 
C = C opt (a)) is obtained with the exponent k = 1. This kind of 
rules are called hereafter rules of type I. 

For the reversed wedge rule with 5 < 5 C = y/2 log 2, and the linear 
rule, hereafter called rules of type II, we find that e t approaches its 
asymptotic value from below, whilst the decrease of e g at small a 
is slower than for rules of type I. The best generalizer (with C = 
C opt (a)) for rules of type II is obtained with exponent k = 2. Figure 
^ presents the different kinds of behaviours for some rules of type I 
and II. The behaviour of C op t as a function of a is also very different 
for each type of rules. 

In fact, the two types of rules may be characterized by which 
patterns are necessarily misclassified by the "student" in the limit 
of a — > oo. In this limit, the weight vector of the SMC tends to be 
aligned either parallel or antiparallel to wo, the normal to the dis- 
criminant hyperplanes corresponding to the rule. This is represented 
on figure [7], where the misclassified patterns lie in the shaded regions. 
For rules of type I, these regions are unbounded half-spaces. On the 
other hand, for type II rules errors are restricted to the bounded 
regions close to the origin. This remark allows to understand why 
the best performances in generalization are obtained with different 
exponents k depending on the type of rule. In general, the student's 
(unique) hyperplane is rotated with respect to the teacher's vector 
wo by an angle that depends on the type of rule and on the exponent 
k. If the training errors lie in the unbounded regions, the rotation 
angle with k = 2 will be larger than with k = 1 because this reduces 
the cost of the errors located far from the hyperplane. If this kind 
of errors cannot be avoided, as arises in rules of type I, the gener- 
alization error with k = 2 will be larger than with k — 1. On the 
other hand, when the unavoidable errors are relatively close to the 
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Sandwich rule, 5=2 




k=1 




k=2 


A 







10 a 15 



Figure 6: Comparison of the generalization errors of SMCs with k = 1 and 
k = 2 learning unrealizable rules. Figures A and B correspond to rules of type 
I and figures C and D correspond to rules of type II. 
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origin of coordinates, their larger cost when using k = 2 will induce 
an orientation of the hyperplane closer to Wo than with k = 1. This 
results in a better generalization performance with k = 2 than with 
k = l. 

To conclude this study of deterministic unrealizable rules, we 
discuss some general results. In particular, the different rules can 
be characterized by a single quantity: 



r r m 

( 7o > = / d 1P ( 1 ) = / Dzzsign(V(z)) = E(-l)^ 2/2 , (13) 

i=l 

where the Zj are the zeros of Viz). (70) represents the average 
stability of the training patterns with respect to a hyperplane normal 
to w passing through the origin. 

One interesting result is that if the rule is such that (70) = 
then the SMC (for all values of C and k) is unable to generalize: 
t g = 1/2 for all values of a, even though the training error remains 
small. This phenomenon is known as memorization without gener- 
alization [|l(J. A similar result has been obtained by Reimann and 



van den Broeck |T6[ when the classifier uses Hebb's rule. 

For large values of a the weight vector of the SMC tends to 
align parallel to w if (70) > and antiparallel if (70) < 0. Notice 
that this does not imply that the best generalization performance 
is reached asymptotically for a — > 00, as it can be shown that e g 
is not a monotonic function of a. The asymptotic value of the 
generalization error is 



t g t g 



;±1)= JdzQ(tzV(z)) (14) 

which can be even larger than 1/2. This means that using the SMC 
for large a can be worse than classifying the patterns randomly. The 
asymptotic behaviour of e g only depends on whether "P(O) = or 
7^(0) 7^ 0, that is, whether one of the rule's hyperplanes contains 
the origin or not. If V(0) = 0, we obtain a power law: 6 9 - e™ oc 
q,-i/2_ fj^g same exponent was found by Fontanari and Meir [12 
for a machine learning a realizable rule with an algorithm accepting 
training errors. IfP(O) 7^ the convergence is exponential: e g — 
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Figure 7: Comparison of the regions with errors for a — > oo, of SMCs learning: 

A) Sandwich rule (type I), 

B) Reversed wedge rule, with 5 > S c (type I), 

C) Linear rule (type II), 

D) Reversed wedge rule, with 5 < 5 C (type II). 

The arrows indicate the asymptotic orientation of the hypcrplane of the SMC. 
In the shaded region the patterns are incorrectly classified. 
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Figure 8: Comparison of two rules. The thick horizontal lines represent the 
hyperplanes of the teacher. The signs + and - are the classes assigned by 
the teacher inside the horizontal bands shown. The line perpendicular to w 
represents the SMC's hyperplane. For a»l, the angle (3 between w and w is 
<C 1. The shaded regions contain the misclassified examples. 

A) Rule with V{z) = z(z - 8). 

B) Rule with V(z) = z-5. 
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a 3 e~ 2 o Q , where zq is the zero of V{z) with smallest norm. The 
existence of these two regimes is related to whether the patterns 
contributing to e g lie close to or far from the student's hyperplane. 
Figure § presents examples of both regimes. The angle (3 < 1 
shown in the figure gives the orientation of the SMC's hyperplane 
relative to that of the teacher, for large a. Within our approach, 
this angle can be calculated as function of a. In fig. |8)A, we show the 
situation corresponding to the rule with V(z) = z(z — 5). Consider 
the difference e 9 (wo) — e g (w) of the generalization error with respect 
to its asymptotic value. It is proportional to the fraction of patterns 
that in the dark grey regions because the contributions of the light 
grey regions compensate each other. As the patterns' distribution is 
gaussian, the main contribution is due to the fraction of points close 
to the origin, which is roughly proportional to the angle /3 <C 1. 
In fig. |]B, we show the rule with V(z) = z — 5. Here again, the 
difference of generalization error is given by the points inside the 
dark grey regions, that in this case are placed far from the origin 
(the contributions from the light grey regions compensating each 
other). The fraction of points in this region is roughly proportional 
to ~ exp(-(<V/3) 2 /2). 

The constants involved in the asymptotic terms do not depend on 
C. This can be understood from the fact that in the limit of large a, 
the complexity term w 2 /2N in the cost function tends to a constant 
and therefore it is the error term that dominates completely, thus 
turning C into a multiplicative constant to the cost. Notice that this 
is not the case for the realizable rule, where we have that w 2 /N — > oo 
if a — > oo. 

We have also calculated the behavior of the quantities of interest 
in the limit of very small number of examples, a< 1. We find that 
the norm of the weight vector increases with a like HwH/viV ~ 
C 2 a. The generalization error decreases as 1/2 — e g ~ (7) 2 «2. This 
interesting result shows that for a small number of examples the 
SMC has some generalization power even for those rules where the 
asymptotic value of e g for large a is worse than if the new inputs 
were randomly classified. 

Another interesting result is that for the SMC with k — 1, the 
fraction of support vectors that lie exactly on the margin tends to 
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1 for a -> if C > 1, but to if C < 1. 



3.4 Stochastic rules 

In this section we consider rules that do not determine univocally 
the class of the patterns. In particular, we study rules where the 
output of the teacher is corrupted by a random noise. When the 
noise is additive the class of pattern x is 

x° + = sgn (P(w • x + rf)) , (15) 

where rj is a random variable drawn from a distribution which we 
assume is a gaussian of variance A. If the noise is multiplicative, the 
class given is 

= sgn("P(w • jcr})) , (16) 

where rj = ±1 with P(j] — 1) — p and P(rj = —1) = 1—p. The effect 
of these two types of noise is the same as that of a corruption of the 
pattern to be classified. The gaussian noise only changes the class of 
patterns that are close to the separating hyperplanes whereas with 
the multiplicative noise the probability of changing the class of a 
pattern with respect to the deterministic rule does not depend on 
distances. The effect of additive noise has been studied using the 
statistical mechanics approach in the case of a perceptron learning 
a rule with V(z) = z by Gyorgyi and Tishby || and by Opper and 



Haussler |13j in their analysis of the bayesian perceptron. 

The asymptotic behaviors we obtain for the SMCs learning noisy 
rules are qualitatively the same as the ones we obtained for unreal- 
izable rules. The relative norm of the classifier, w/y/N, tends to a 
constant value when a — > oo for both types of noise. The aligne- 
ment of the weight vector with the vector of the rule depends on the 
sign of (70). 

The generalization error shows different asymptotic behaviours, 
depending on the type of noise. For multiplicative noise we obtain 
a power law behaviour if one of the hyperplanes of the rule contains 
the origin of coordinates and an exponential decay otherwise, like 
in the case of unrealizable rules. On the other hand, for gaussian 
additive noise, the rate of convergence does not depend on V(0) and 
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is a power law for all the rules: e g — ~ 9L ^- YhL%(~ 1)* exp(— |^) 

where m is the number of zeros of V(z) and ip = yl + A 2 /A. The 
suppression of the exponential convergence that exists in the case 
of a deterministic unrealizable rule with no teacher's hyperplane 
passing thorugh the origin can be understood in the same way as 
before. The important point is that additive noise alters the class 
of patterns that are close to the hyperplane. This introduces errors 
inside the central strip which are enough to suppress the exponential 
convergence, which is recovered in the limit of vanishing noise, when 
A — > 0. Notice that the multiplicative noise does not introduce any 
errors inside the central band. 

The learning curves for the particular case of rules with V(z) = z 
with multiplicative noise are shown in figure |^. Figure |TD| presents 
results for gaussian additive noise. We can see for these rules the 
same effects we observed for deterministic rules of type I and II 
respectively. Geometrically, the reason is clear: as the gaussian 
noise changes mostly the class of the patterns that are near the 
hyperplanes of the teacher, the errors made in this case will be 
bounded, whereas for multiplicative noise the alteration of the class 
does not depend on the distance, producing unbounded errors. 

4 Conclusions 

We have studied the typical properties of the Soft Margin Classi- 
fiers, using the tools of Statistichal Mechanics, for several different 
scenarios. This approach allowed us to study also the properties of 
the optimal classifier, which is the one obtained when the hyperpa- 
rameter C is tuned to obtain the lowest value of the generalization 
error. It turns out that, for realizable rules, the classifier obtained 
with C op t is very close to the optimal performance, given by Bayesian 
learning. The best results are obtained with an exponent k = 2 for 
the slacks in the cost function; the relative difference between both 
learning curves is smaller than 1.7%, for all a. 

As the generalization error cannot be known exactly in practice, 
the values of C opt that we have obtained are only useful to provide a 
reference generalization error curve, against which the performance 
of the various algorithms devised to optimize C may be tested. 
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Figure 9: Training and generalization errors for an SMC learning a rule with 
V(z) — z corrupted by multiplicative noise (P{q = 1) = 0.922). The horizontal 
line shows the asymptotic value of both errors. The symbols represent simula- 
tions made with N=100, averaged over enough training sets to make the error 
bars smaller than the symbols. 
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a a 

Figure 10: Training and generalization errors for an SMC learning a rule with 
V(z) — z corrupted by additive gaussian noise (rj = 0.97). The horizontal line 
shows the asymptotic value of both errors. The symbols represent simulations 
made with N=100, averaged over enough training sets to make the error bars 
smaller than the symbols. 
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In general, when the rule is non realizable and the SMC cannot 
avoid misclassification of some training patterns, the learning curves 
present two types of behaviours. In rules of type I, in which these 
unavoidable errors lie at large distances of the SMC's hyperplane, 
the generalization performance is better if k — 1 is used as exponent 
of the slack variables in the cost function. Conversely, if errors are 
confined to a strip containing the origin, like in rules of type II, it 
may be more convenient to use a SMC with k = 2. This is due to 
the way errors are weighted in the cost function, and may be used as 
a rule of the thumb for applications, as it only uses as a criterion the 
distances of the misclassified patterns to the discriminating surface. 

In the case of an SMC learning patterns whose classes are entirely 
random, the best learning performance is achieved for k — 1. This is 
not surprising, as this case is similar to the rules of type I, because as 
the classes are random, the errors made by the classifier are evidently 
unbounded. 

For the unrealizable rules considered, the convergence of the 
training and generalization errors to their asymptotic values as a 
function of a in the limit a 1, follows either an exponential or a 
power law decay with exponent 1/2, depending on whether or not 
one of the teacher's hyperplanes contains the origin. If there is a 
gaussian additive noise, only the power law decay exists. It would be 
interesting to determine if this two types of asymptotic behaviours 
are universal for all the unrealizable rules or if there are still more 
possible regimes. It is remarkable that even though the asymptotic 
value of e t and e g can be larger than one half (which is worse than 
that achieved by randomly classifying the patterns), in the regime 
of small values of a we have always e t < 1/2 and e g < 1/2. 

The statistical mechanics approach can be extended to consider 
more complicated (and probably more realistic) pattern distribu- 
tions, like biased or non-gaussian distributions. The bias b can be 
included, but the calculations become much more complicated. If a 
bias is allowed, probably the asymptotic exponential behavior men- 
tioned above would disappear, because in this case the hyperplane 
of the classifier can be shifted until it coincides with one of the hy- 
perplanes of the considered rule. Classifiers using k = 3 in the cost 
function can also be studied within this approach, in much the same 
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way as k — 1 and k — 2, and they may present some interesting fea- 
tures, even though they are not used in practice. The model can be 
extended to include the mappings from the input space to a feature 
space, but at the expense of considerably increasing the complexity 
of the calculations. 
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